AITopics | residual branch

Collaborating Authors

residual branch

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

e6b738eca0e6792ba8a9cbcba6c1881d-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-10-2026, 21:46:06 GMT

activation, residual block, residual branch, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Removing the Feature Correlation Effect of Multiplicative Noise

Zijun Zhang, Yining Zhang, Zongpeng Li

Neural Information Processing SystemsNov-20-2025, 20:57:03 GMT

State-of-the-art deep neural networks are often over-parameterized to deliver more expressive power.

artificial intelligence, machine learning, noise, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
North America > Canada > Ontario > Toronto (0.14)
Asia > China > Hubei Province > Wuhan (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On residual network depth

Dherin, Benoit, Munn, Michael

arXiv.org Machine LearningOct-7-2025

Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.

complexity, ensemble, residual network, (14 more...)

arXiv.org Machine Learning

2510.0347

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

A Definition of a batch normalization layer When applying batch normalization to convolutional layers, the inputs and outputs of normalization layers are 4-dimensional tensors, which we denote by I

Neural Information Processing SystemsAug-17-2025, 01:40:03 GMT

For distributed training, the batch statistics are usually estimated locally on a subset of the training minibatch ("ghost batch normalization" [ We now define the three models in full. These inputs first pass through a single fully connected linear layer of width 1000. We then apply a series of residual blocks. LeCun normal initialization [48] to preserve the variance in the absence of non-linearities. We then apply a series of residual blocks.

artificial intelligence, machine learning, normalization layer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Neural Information Processing SystemsAug-17-2025, 01:39:55 GMT

This paper provides a simple explanation for why batch normalized deep residual networks are easily trainable.

batch normalization, batch size, normalization, (14 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

e6b738eca0e6792ba8a9cbcba6c1881d-AuthorFeedback.pdf

Neural Information Processing SystemsAug-17-2025, 01:39:44 GMT

artificial intelligence, machine learning, residual branch, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Review for NeurIPS paper: Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Neural Information Processing SystemsFeb-7-2025, 14:23:09 GMT

Weaknesses: * There might be multiple reasons make networks BN trainable under extreme conditions, including large learning rate and huge depth. I agree the point made by this work, that small init in residual branches is such a reason, which in turn makes vanilla resnet withour normalization trainble, however It's possible that the normalized resnet are trainable even without small init in residual branches. It's well known that the input/output scale for the weights before batch normalization is not making as much sense as they do for networks without normalization. For example, Li&Arora, 2019 shows that slightly modified ResNet is trainable with exponential increasing LR and achieves equally good performance as Step Decay schedule. The output of the residual blocks could also grow exponentially, but the network is still trainable because the gradients are small.

batch normalization bias residual block, identity function, residual branch, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.56)

Add feedback

Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models

Dotzel, Jordan, Akhauri, Yash, AbouElhamayed, Ahmed S., Jiang, Carly, Abdelfattah, Mohamed, Zhang, Zhiru

arXiv.org Artificial IntelligenceApr-7-2024

Large language models (LLMs) often struggle with strict memory, latency, and power demands. To meet these demands, various forms of dynamic sparsity have been proposed that reduce compute on an input-by-input basis. These methods improve over static methods by exploiting the variance across individual inputs, which has steadily grown with the exponential increase in training data. Yet, the increasing depth within modern models, currently with hundreds of layers, has opened opportunities for dynamic layer sparsity, which skips the computation for entire layers. In this work, we explore the practicality of layer sparsity by profiling residual connections and establish the relationship between model depth and layer sparsity. For example, the residual blocks in the OPT-66B model have a median contribution of 5% to its output. We then take advantage of this dynamic sparsity and propose Radial Networks, which perform token-level routing between layers guided by a trained router module. These networks can be used in a post-training distillation from sequential networks or trained from scratch to co-learn the router and layer weights. They enable scaling to larger model sizes by decoupling the number of layers from the dynamic depth of the network, and their design allows for layer reuse. By varying the compute token by token, they reduce the overall resources needed for generating entire sequences. Overall, this leads to larger capacity networks with significantly lower compute and serving costs for large language models.

layer sparsity, residual ratio, sparsity, (16 more...)

arXiv.org Artificial Intelligence

2404.049

Country: Asia > Middle East > Jordan (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Optimal signal propagation in ResNets through residual scaling

Fischer, Kirsten, Dahmen, David, Helias, Moritz

arXiv.org Artificial IntelligenceMay-12-2023

Residual networks (ResNets) have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks (FFNets), finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size theory for ResNets to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a framework for theory-guided optimal scaling in ResNets and, more generally, provides the theoretical framework to study ResNets at finite widths.

artificial intelligence, machine learning, response function, (19 more...)

arXiv.org Artificial Intelligence

2305.07715

Country:

North America > United States (0.14)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

How to Use Dropout Correctly on Residual Networks with Batch Normalization

Kim, Bum Jun, Choi, Hyeyeon, Jang, Hyeonah, Lee, Donggeon, Kim, Sang Woo

arXiv.org Artificial IntelligenceFeb-13-2023

For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models.

artificial intelligence, dropout, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.06112

Country: Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback